Descriptive Statistics

The Importance of Visualization

Michael Luu, MPH | Marie Lauzon, MS
Biostatistics & Bioinformatics Research Center | Cedars Sinai Medical Center
September 12, 2023

Why do we need to visualize our data?

Data

x y
55.4 97.2
51.5 96.0
46.2 94.5
42.8 91.4
40.8 88.3
38.7 84.9
35.6 79.9
33.1 77.6
29.0 74.5
26.2 71.4
x y
58.2 91.9
58.2 92.2
58.7 90.3
57.3 89.9
58.1 92.0
57.5 88.1
28.1 63.5
28.1 63.6
28.1 63.1
27.6 62.8
x y
38.3 92.5
35.8 94.1
32.8 88.5
33.7 88.6
37.2 83.7
36.0 82.0
39.2 79.3
39.8 82.3
35.2 84.2
40.6 78.5
x y
56.0 79.3
50.0 79.0
51.3 82.4
51.2 79.2
44.4 78.2
45.0 77.9
48.6 78.8
42.1 76.9
41.0 76.4
34.6 72.7

Let’s begin by taking descriptive measures

dataset n mean_x sd_x mean_y sd_y
A 142 54.3 16.8 47.8 26.9
B 142 54.3 16.8 47.8 26.9
C 142 54.3 16.8 47.8 26.9
D 142 54.3 16.8 47.8 26.9


It appears the counts (n), mean (x), mean (y), and sd (x) and sd (y) are identical for ALL four datasets!

Can we conclude the datasets are similiar or identical?

Not quite yet!

Let’s visualize the relationship of x and y

Although simple quantitative summaries are similar …

They can appear drastically different when visualized!

Datasaurus Dozen

Datasaurus Dozen

Anscombe Quartet

  • The datasaurus dozen is a modern take on the classical “Anscombe’s Quartet”1

  • Comprised of four datasets that have nearly identical simple summary measures, yet have very different distributions and appear vastly different when plotted

Anscombe Quartet

dataset n mean_x sd_x meay_y sd_y
I 11.00 9.00 3.32 7.50 2.03
II 11.00 9.00 3.32 7.50 2.03
III 11.00 9.00 3.32 7.50 2.03
IV 11.00 9.00 3.32 7.50 2.03

Types of Graphical Visualizations

Dot plot

  • Useful for small to moderate sized data

  • Allows us to visualize the spread and distribution of one continuous discrete variables

    • e.g. length of stay
  • The X axis is the variable of interest and each dot represents a single observation

  • Easy to identify the mode

  • Highlights clusters, gaps, and outliers

  • Intuitive and easy to understand

Histogram

  • Useful for all sized data (small and large)

  • Allows us to visualize the spread and distribution of continuous variables

  • Each bar represents a ‘bin’ or a defined interval of values

  • Although not as common, the width of the bins does NOT have to be equal!

  • The y axis or the height of the bar represents the count of the number of values that fall into each bin

  • The y axis is also commonly normalized to ‘relative’ frequencies to show the proportion of cases or density that falls into each bin.

Distribution

“A distribution is simply a collection of data, or scores, on a variable. Usually, these scores are arranged in order from smallest to largest and then they can be presented graphically.”1

Distribution

Normal Distribution

Univariate Continuous Distributions

Univariate Discrete Distributions

Scatter plot

  • Used to visualize the relationship between two continuous variables

  • Useful for detecting patterns that are obscured from quantitative summaries like what we observed in Anscombe’s quartet and the Datasaurus dozen.

Bar plot

  • Useful for visualizing categorical data

  • Commonly used to present counts and proportion of each level

  • Allows us to quickly observe the difference in magnitude of each level based on the height of each bar

However…

Bar plots are commonly misued!

How NOT to Bar Plot

How NOT to Bar Plot

  • Although frequently found and prevalent in the literature, this is NOT to be used to describe mean and dispersion (continuous data)

  • Only shows one arm of the error bar, making overlap comparisons difficult

  • Promotes misconception of the mean being related to its height rather the position of the top of the bar

  • Obscures the distribution and spread of the data

Box plot

  • Useful for describing continuous variables following a uni-modal distribution
    • e.g. a single peak
  • The box is representative of common quantitative measures
    • Top of box is the 75th quantile
    • Middle dash inside box is the 50th quantile
    • Bottom of box is the 25th quantile
    • Width of the box is the interquartile range (IQR)
  • The ‘whiskers’ are artificial ‘fences’ that helps identify potential outliers in the data
    • Defined as Q1 - 1.5*IQR and Q3 + 1.5*IQR

What are some of the problems with a box plot?

They are based on quantitative summaries!

Box plot

Violin plot

  • Violin plots are box plots, with an overlay of the density distribution (histogram) of the data

  • More informative than a simple box plot

  • Visualizes the full distribution of the data

  • Especially useful for bimodal or multimodal distribution

    • e.g. distribution of data with multiple peaks

How are violin plots made?

Summary

  • One continuous variable
    • Dot plot
    • Histogram
    • Box plot
    • Violin plot
  • One or more categorical variable
    • Bar plot
  • Two continuous variable
    • Scatter plot
  • One continuous by categorical variable
    • Dot plot
    • Box plot
    • Violin plot

Descriptive summaries are useful, however …

Don’t forget to visualize your data!

Questions